class: center, middle, inverse, title-slide .title[ # Sequence Alignment Map ] .author[ ###
James Ashmore
• 23-Sep-2022 ] .institute[ ### Zifo RnD Solutions ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ------------ Only edit title, subtitle & author above this ------------ --> --- # The SAM format ## Overview <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.sam.png" alt="Fig 1: Example of SAM file format" width="100%" /> <p class="caption">Fig 1: Example of SAM file format</p> </div> * SAM stands for **S**equence **A**lignment **M**ap format * Store biological sequences aligned to a reference sequence * TAB-delimited text format consisting of a **header** and **alignment** section * Each alignment has **11** mandatory fields for essential alignment information * The full SAM format specification is available [online](http://samtools.github.io/hts-specs/SAMv1.pdf) * Developed by [Heng Li](https://www.broadinstitute.org/bios/heng-li) and [Bob Handsaker](https://www.broadinstitute.org/bios/bob-handsaker) et al. (2009) --- # The SAM format ## Header <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.header.png" alt="Fig 2: Example of SAM header" width="100%" /> <p class="caption">Fig 2: Example of SAM header</p> </div> * Each line begins with an `@` character followed by a record type: * File-level metadata `HD` * Reference sequence dictionary `SQ` * Read group `RG` * Program `PG` * Comment `CO` * Each record type contains different record tags: * For example, `SQ` contains reference sequence name `SN`, length `LN`, etc. * The full list of record types and their tags is available [online](https://samtools.github.io/hts-specs/SAMv1.pdf#page=3) --- # The SAM format ## Alignments <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.png" alt="Fig 3: Example of SAM alignments" width="100%" /> <p class="caption">Fig 3: Example of SAM alignments</p> </div> * Each line typically represents the linear alignment of a segment * Each alignment consists of 11 or more TAB-separated fields: * `QNAME` `FLAG` `RNAME` `POS` `MAPQ` `CIGAR` `RNEXT` `PNEXT` `TLEN` `SEQ` `QUAL` --- # The SAM format ## Alignments <table class="table" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Col </th> <th style="text-align:left;"> Field </th> <th style="text-align:left;"> Type </th> <th style="text-align:left;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> QNAME </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> Query template NAME </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> FLAG </td> <td style="text-align:left;"> Int </td> <td style="text-align:left;"> bitwise FLAG </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> RNAME </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> Reference sequence NAME </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> POS </td> <td style="text-align:left;"> Int </td> <td style="text-align:left;"> 1-based leftmost mapping POSition </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> MAPQ </td> <td style="text-align:left;"> Int </td> <td style="text-align:left;"> MAPping Quality </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> CIGAR </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> CIGAR string </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> RNEXT </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> Reference name of the mate/next read </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> PNEXT </td> <td style="text-align:left;"> Int </td> <td style="text-align:left;"> Position of the mate/next read </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> TLEN </td> <td style="text-align:left;"> Int </td> <td style="text-align:left;"> observed Template LENgth </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> SEQ </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> segment SEQuence </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> QUAL </td> <td style="text-align:left;"> String </td> <td style="text-align:left;"> ASCII of Phred-scaled base QUALity+33 </td> </tr> </tbody> </table> <center>Tab 1: Alignment fields</center> --- # The SAM fields ## QNAME (1) .pull-left-60[ * Identical QNAME come from the same template * `*` indicates the information is unavailable * Reads may occupy multiple alignment lines when: * The alignment is chimeric * There are multiple mappings ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.QNAME.png" alt="Fig 4: Query template NAME" width="35%" /> <p class="caption">Fig 4: Query template NAME</p> </div> ] --- # The SAM fields ## FLAG (2) .pull-left-60[ * The FLAG encodes attributes of a read alignment * It is displayed as a single integer code: * READ MAPPED TO FORWARD STRAND = 0 * READ UNMAPPED = 4 * READ MAPPED TO REVERSE STRAND = 16 * READ FAILED QUALITY CONTROl = 512 * Integers are the sum of **bitwise** flags * A bitwise flag encodes a specific **attribute** * The attributes are summed to get the final value ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.FLAG.png" alt="Fig 5: Combination of bitwise FLAGs" width="35%" /> <p class="caption">Fig 5: Combination of bitwise FLAGs</p> </div> ] --- # The SAM fields ## FLAG (2) .pull-left-50[ <table class="table" style="font-size: 13px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Integer </th> <th style="text-align:left;"> Binary </th> <th style="text-align:left;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 000000000001 </td> <td style="text-align:left;"> Read paired </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 000000000010 </td> <td style="text-align:left;"> Read mapped in proper pair </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 000000000100 </td> <td style="text-align:left;"> Read unmapped </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 000000001000 </td> <td style="text-align:left;"> Mate unmapped </td> </tr> <tr> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> 000000010000 </td> <td style="text-align:left;"> Read reverse strand </td> </tr> <tr> <td style="text-align:left;"> 32 </td> <td style="text-align:left;"> 000000100000 </td> <td style="text-align:left;"> Mate reverse strand </td> </tr> <tr> <td style="text-align:left;"> 64 </td> <td style="text-align:left;"> 000001000000 </td> <td style="text-align:left;"> First in pair </td> </tr> <tr> <td style="text-align:left;"> 128 </td> <td style="text-align:left;"> 000010000000 </td> <td style="text-align:left;"> Second in pair </td> </tr> <tr> <td style="text-align:left;"> 256 </td> <td style="text-align:left;"> 000100000000 </td> <td style="text-align:left;"> Not primary alignment </td> </tr> <tr> <td style="text-align:left;"> 512 </td> <td style="text-align:left;"> 001000000000 </td> <td style="text-align:left;"> Read fails platform/vendor quality checks </td> </tr> <tr> <td style="text-align:left;"> 1024 </td> <td style="text-align:left;"> 010000000000 </td> <td style="text-align:left;"> Read is PCR or optical duplicate </td> </tr> <tr> <td style="text-align:left;"> 2048 </td> <td style="text-align:left;"> 100000000000 </td> <td style="text-align:left;"> Supplementary alignment </td> </tr> </tbody> </table> <center>Tab 2: bitwise FLAGs</center> ] .pull-right-50[ <table class="table" style="font-size: 13px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Ex 1: An umapped read which failed quality control</caption> <thead> <tr> <th style="text-align:left;"> Flag </th> <th style="text-align:left;"> Meaning </th> <th style="text-align:left;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> Read umapped </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 512 </td> <td style="text-align:left;"> Read fails QC </td> <td style="text-align:left;"> 516 </td> </tr> </tbody> </table> <table class="table" style="font-size: 13px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Ex 2: A supplementary alignment from the first read in a pair where the mate is aligned to the reverse strand</caption> <thead> <tr> <th style="text-align:left;"> Flag </th> <th style="text-align:left;"> Meaning </th> <th style="text-align:left;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> Read paired </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> 32 </td> <td style="text-align:left;"> Mate reverse strand </td> <td style="text-align:left;"> 33 </td> </tr> <tr> <td style="text-align:left;"> 64 </td> <td style="text-align:left;"> First in pair </td> <td style="text-align:left;"> 97 </td> </tr> <tr> <td style="text-align:left;"> 2048 </td> <td style="text-align:left;"> Supplementary alignment </td> <td style="text-align:left;"> 2145 </td> </tr> </tbody> </table> ] --- # The SAM fields ## RNAME (3) .pull-left-60[ * Reference sequence name of the alignment * An unmapped read has a `*` in this field ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.RNAME.png" alt="Fig 6: Reference sequence NAME" width="35%" /> <p class="caption">Fig 6: Reference sequence NAME</p> </div> ] --- # The SAM fields ## POS (4) .pull-left-60[ * 1-based leftmost mapping position of the read * First base in a reference sequence has coordinate 1 * POS is set as 0 for an unmapped read without coordinate ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.POS.png" alt="Fig 7: 1-based leftmost mapping POSition" width="35%" /> <p class="caption">Fig 7: 1-based leftmost mapping POSition</p> </div> ] --- # The SAM fields ## MAPQ (5) .pull-left-60[ * Integer mapping of the probability of an **incorrect** mapping position * MAPQ calculation: `\(Q=-10\log_{{10}}P\)` <br> <table class="table" style="font-size: 15px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Mapping quality </th> <th style="text-align:right;"> Probability of incorrect mapping position </th> <th style="text-align:right;"> Mapping position accuracy </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1 in 10 </td> <td style="text-align:right;"> 90% </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1 in 100 </td> <td style="text-align:right;"> 99% </td> </tr> <tr> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 1 in 1000 </td> <td style="text-align:right;"> 99.9% </td> </tr> <tr> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1 in 10,000 </td> <td style="text-align:right;"> 99.99% </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 1 in 100,000 </td> <td style="text-align:right;"> 99.999% </td> </tr> <tr> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 1 in 1,000,000 </td> <td style="text-align:right;"> 99.9999% </td> </tr> </tbody> </table> <center>Tab 3: Mapping quality</center> ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.MAPQ.png" alt="Fig 8: MAPping Quality" width="35%" /> <p class="caption">Fig 8: MAPping Quality</p> </div> ] --- # The SAM fields ## CIGAR (6) .pull-left-60[ * Represents **how** the read aligned to the reference * Use characters to represent individual operations ```default # M Match # N Gap # D Deletion # I Insertion ``` * An alignment with POS = `2` and CIGAR = `6M` ```default # REF: AAGTCTAGAA # SEQ: GTCTAG ``` * An alignment with POS = `2` and CIGAR = `3M2I3M` ```default # REF: AAGT--TAGAA # SEQ: GTCGATAG ``` * An alignment with POS = `2` and CIGAR = `2M1D3M` ```default # REF: AAGTCTAGAA # SEQ: GT-TAG ``` ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.CIGAR.png" alt="Fig 9: CIGAR string" width="35%" /> <p class="caption">Fig 9: CIGAR string</p> </div> ] --- # The SAM fields ## RNEXT (7), PNEXT (8), TLEN (9) .pull-left-60[ * Reference sequence name of the primary alignment of the mate from a paired-end read * Mapping position of the mate from a paired-end read * Template length calculated from the mapping position from both mates ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.RNEXT.PNEXT.TLEN.png" alt="Fig 10: RNEXT, PNEXT, TLEN" width="45%" /> <p class="caption">Fig 10: RNEXT, PNEXT, TLEN</p> </div> ] --- # The SAM fields ## SEQ (10), QUAL (11) .pull-left-60[ * The sequence of the read * The Phred quality scores of the read base calls ] .pull-right-40[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.SEQ.png" alt="Fig 11: Read sequence" width="85%" /> <p class="caption">Fig 11: Read sequence</p> </div> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/sam/wgsim/reads.alignment.QUAL.png" alt="Fig 12: Phred quality score" width="85%" /> <p class="caption">Fig 12: Phred quality score</p> </div> ] --- # The Samtools suite ## Overview * Collection of programs for interacting with high-throughput sequencing data * Contains three separate programs: 1. Samtools 2. BCFtools 3. HTSlib * Samtools is used to read and write SAM/BAM files * The Samtools manual is available [online](https://www.htslib.org/doc/samtools.html) * Developed by [Heng Li](https://www.broadinstitute.org/bios/heng-li) and [Bob Handsaker](https://www.broadinstitute.org/bios/bob-handsaker) et al. (2009) * Alternative tools for SAM/BAM files include [BAMtools](https://github.com/pezmaster31/bamtools) and [Picard](https://github.com/broadinstitute/picard) --- # The Samtools suite .pull-left-50[ ## Samtools commands * Contains 36 different commands * Grouped into different categories: * Indexing * Editing * File operations * Statistics * Viewing * Miscellaneous * Convert SAM to compressed binary format called BAM * Index SAM file to speed up retrieving alignments ] .pull-right-50[ <img src="data:image/png;base64,#data/sam/samtools-help.png" width="80%" style="display: block; margin: auto;" /> ] --- # Summary - SAM stands for Sequence Alignment Map format - Store biological sequences aligned to a reference sequence - TAB-delimited text format consisting of a header and alignment section - Each header has data fields for essential header information - Each alignment has 11 mandatory fields for essential alignment information - BAM is the compressed binary version of the SAM format - Samtools is used to read, write, and execute SAM/BAM files --- # Resources ## Specification * [Sequence Alignment/Map Format](http://samtools.github.io/hts-specs/SAMv1.pdf) ## Manual * [Samtools](https://www.htslib.org/doc/samtools.html) ## Papers * [Twelve years of SAMtools and BCFtools](https://pubmed.ncbi.nlm.nih.gov/33590861) by Petr Danecek et al. * [The Sequence Alignment/Map format and SAMtools](http://www.ncbi.nlm.nih.gov/pubmed/19505943) by Heng Li et al. ## Videos * [Lockdown Learning: SAMtools](https://youtu.be/Llaxuzr6EkA) by Simon Cockell * [Bioinformatics Coffee Hour: Samtools](https://youtu.be/Z4A2LCPyVU4) by Harvard FAS Informatics * [SAM, BAM, CRAM format](https://youtu.be/XU8atPxM0VQ) by Aaron Quinlan <!-- --------------------- Do not edit this and below --------------------- --> --- name: end_slide class: end-slide, middle count: false # Thank you. Questions? .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 23-Sep-2022 • James Ashmore • <a href="https://www.zifornd.com/category/omics-bioinformatics">Bioinformatics</a> • <a href="https://www.zifornd.com">Zifo</a> </p> ]